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ABSTRACT 


This  paper  presents  an  analysis  of  software  related  system  failures  on 
the  IBn  3081  at  SLAC.  He  find  three  broad  categories  of  fai lures  *  error 
handling*  control  or  logio  problems  and  hardware-related.  A  statistical 
analysis  shows  (not  unexpectedly)  a  doereasing  failure  rata  with  time. 
This  is  especially  true  in  the  early  part  of  the  study.  Net  withstand¬ 
ing  the  dooreasing  failure  rate  with  time*  wa  find  that  the  occurrence 
of  failures  Is  strongly  correlated  with  the  typo  and  level  of  workload 
prior  to  the  oceurrenee  of  a  failure.  For  example*  it  is  shown  that  the 
risk  of  a  software  related  failure  inereases  in  a  non-linear  fashion 
with  the  amount  of  interactive  processing*  as  measured  by  paging  rate 
and  system  overhead.  The  paper  employs  a  statistical  modal  to  describe 
the  load  dependency  and  offers  explanations  for  the  observed  phenomenon. 

Keywords!  software  reliability*  workload*  statistioal  failure  models* 
data  analysis. 
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1.  INTRODUCTION 


The  highly  interactive  and  diverse  nature  of  modern  day  aystesM  has  aade 
high  reliability  a  central  Issue  In  eomputer  systea  design.  Hast 
researchers  in  the  area  would  agree  that  it  is  net  feasible  to  guarantee 
a  perfect  systea*  either  in  herdware  or  in  software.  Accordingly, 
depending  on  tho  neture  of  tho  application,  it  is  iaportant  to  design 
Into  tho  systea  the  ability  either  to  continue  operation  In  the  event  of 
a  failure  or  to  react  to  a  failure  in  a  predictable  aanner. 

Designing  hardware  systeas  that  tolerate  faults  is  relatively  well 
understood,  at  least  from  a  theoretical  viewpoint.  However,  the  problem 
of  software  fault  tolerance  (especially  the  question  of  hardware/soft¬ 
ware  interaction)  has  yet  to  be  well  understood  [Heoht  SO].  A  reason 
for  this  is  that  neither  the  error  generation  process  nor  the  prediction 
problem  are  easy  to  oomprehend,  although  the  SIFT  studies  have  been  an 
iaportant  contribution  [Hensley  78]  [Mil  1 iar-Smith  81]. 

Theoretical  models  can  only  deal  with  a  restricted  class  of  problems. 
Host  often  it  is  the  problems  outside  the  range  of  theoretical  models 
which  cause  the  most  severe  malfunctions.  Accordingly,  at  this  stage 
there  is  no  better  substitute  for  results  based  on  actual  measurements 
and  experimentation  [Curtis  80];  such  results  are  few  and  far  between 
[Denning  80]. 

This  paper  presents  results  of  one  such  analysis  conducted  on  the 
¥11/370  operating  system  on  the  1811  3081  at  the  Stanford  Linear  Acceler¬ 
ator  Cantor  (SLAC)  computation  facility. 

The  Stanford  Linear  Accelerator  Center  Is  engaged  In  the  study  of 
high  energy  particle  physics.  A  twe  mile  long  linear  accelerator  and 
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the  assoc it tad  real-time  data  network  provida  a  vast  amount  of  physical 
data  for  analysis.  Tha  SLAC  installation  (which  was  reconfigured  in 
February  1981 >  oonsists  of  three  proeaaaors.  There  are  two  IBM  370/168 
processors,  running  the  0S/VS2(SVS)  operating  system,  that  provida 
aainly  batch  service.  The  last  is  an  IBM  3081  exclusively  running  the 
VH/370  (Virtual  Machine)  operating  system.  During  a  typical  day,  the 
3081  complex  has  approximately  150  time-sharing  users  with  a  sizeable 
compute-bound  background  load.  Although  there  is  some  communication 
with  the  older  system,  for  practical  purposes  the  3081  is  run  as  an 
independent  system. 

Our  general  objective  was  to  study  the  causes  of  system  failures,  due 
to  software,  in  a  fully  operational  produetion  environment.  Me  intended 
not  only  to  Investigate  the  effect  of  persistent  bugs  in  reasonably 
mature  software  systems,  but  also  to  study  the  interactions  with  the 
rest  of  the  system.  In  particular  we  wished  to  consider  the  following 
questions: 

1.  Hhat  are  the  most  common  types  of  software  related  failures  and 
their  relative  frequencies? 

2.  Are  there  any  Identifiable  failure  patterns  that  occur  most 
often?  For  example,  is  an  inadequate  hardware-software  interface 
a  frequent  oause  of  system  failure? 

3.  Is  there  a  relationship  between  operating  system  failures  and  the 
usage  environment  as  represented  by  various  measures  of  system 
activity? 

4.  Mhat  Inferences  can  be  drawn  from  our  analysis  in  relation  to 
both  the  design  and  testing  of  targe  software  systems? 


a 

Our  general  approach  is  to  asauae  no  aede)  a  priori,  but  rather  te  start 
with  a  substantial  aaount  of  high  quality  data  on  softuare  related  fail¬ 
ures  and  systea  activity.  We  report  on  statistical  trends  and  relation¬ 
ships  found  in  the  data  uith  the  aia  of  discovering  an  underlying  aodel. 
The  experience  gained  and  the  aodels  found  would,  in  our  view,  provide 
valuable  insight  into  the  question  of  fault-tolerant  design  in  general 
and  softuare  design  in  particular.  Tha  naxt  section  describes  the  basis 
for  this  work  in  detail  and  places  related  studies  in  perspective. 

2.  BASIS  AND  PERSPECTIVE 

The  tera  "software  reliability  aodel"  is  usually  taken  to  aean  Mathemat¬ 
ical  aodels  for  assessing  the  reliability  of  software  (in  terms  of  sta¬ 
tistical  parameters  such  as  riTEF)1  during  the  development,  debugging  or 
testing  phases,  although  a  feu  of  these  models  have  also  been  applied  in 
follow-up  operational  phases.  In  this  context  there  have  been  two  dis¬ 
tinct  approaches  to  studying  software  reliability.  In  the  first,  soft¬ 
ware  reliability  is  defined  In  a  manner  similar  to  hardware  reliability. 
Several  competing  models  have  appeared  in  the  literature  [Musa  1980]. 
and  a  number  of  authors  have  attempted  to  analyze  their  suitability;  an 
appreciation  of  the  extent  and  nature  of  this  discussion  can  be  obtained 
from  [Ooel  80].  There  is,  perhaps,  some  evidence  to  suggest  that  the 
hardware  analogy  may  have  been  oarried  too  far  [Littlewood  80]. 

The  second  approach  attempts  to  exploit  the  dose  relationship 
between  software  quality  measures  (e.g.  complexity)  and  reliability. 
The  parameters  of  these  models  are  the  attributes  of  the  programs  to  be 

1  Mean  Time  Between  Failures 
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studied.  There  ere  many  measures  of  software  quality,  the  most  well 
known  being  those  proposed  by  [Boehm  78]  and  [McCall  77].  Both  techni¬ 
ques  are  developed  froa  an  intuitive  clustering  of  primitive  quality 
measures.  The  main  difficulty  with  these  approaches  is  that*  although 
there  is  agreement  on  what  should  be  measured,  there  is  little  agreement 
on  how  best  to  evaluate  and  measure  them  in  practice.  finally,  even 
though  each  model  appears  to  be  valid  within  its  own  assumptions,  there 
is  insufficient  experimental  evidence  available  for  its  large  scale 
val idity. 

Research  most  closely  related  to  the  present  study  is  in  the  area  of 
analysis  of  errors  and  their  causes  in  large  software  systems.  [Endres 
75]  discusses  and  categorises  errors  and  error  frequencies  during  the 
internal  testing  phase  of  the  IBfl  DOS/VS  system.  [Hamilton  78]  applies 
the  well  known  execution  time  model  [Musa  80]  to  measure  the  operational 
reliability  of  computer  center  software.  and  [Glass  80]  examines  the 
occurrence  of  persistent  bugs  and  their  causes  in  operational  software. 
Another  useful  studies  is  [Maxwell  78].  which  tabulates  and  examines 
error  statistics  on  software. 

None  of  these  studies  try  to  relate  system  reliability  or  the  error 
frequencies  to  the  usage  environment  of  the  software  itself  in  a  system¬ 
atic  manner.  Results  based  on  such  measurements  are  essential  if  a  sci¬ 
entific  basis  is  to  be  developed  for  software  reliability  evaluation. 
The  argument  for  adopting  a  particular  approach  is  more  convincing  if 
backed  by  experiments  demonstrating  its  usefulness. 

The  operational  phase  of  mature  software  is  somewhat  different  from 
the  development,  debugging,  and  testing  phases.  A  typical  situation  is 
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ant  Hhtrt  frequent  changes  and  updttts  art  installed  either  by  the 
installation  programmers  or  by  the  vendor.  Often  the  vendor  Hi  11 
install  a  change  to  fix  an  error  found  at  some  other  installation*  uith- 
out  any  notification  to  the  installation  management.  In  a  sense  the 
system  being  measured  the  represents  an  aggregate  of  all  such  systems 
maintained  by  the  vendor. 

An  experimental  study  therefore  provides  not  only  a  view  of  the  end 
product  but  also  gives  some  insight  into  the  persistent  problems.  This 
information  can  be  valuable  both  in  designing  neu  systems  and  in  devel¬ 
oping  testing  strategies  for  neu  releases. 

In  an  early  study  of  failures  on  the  SLAC  Triplex  system2  [Iyer  82] 
found  a  strong  correlation  between  the  occurrence  of  failures  (both 
hardware  and  software)  and  the  load  on  the  system  as  measured  by  vari¬ 
ables  such  as  the  paging  rate  and  the  jobstep  processing  rate.  All 
failures  were  considered*  not  simply  the  ones  which  led  to  system  ser¬ 
vice  interruptions.  Most  importantly  the  effects  were  such  that  the 
average  failure  rate  for  both  hardware  and  software  components  varied 
cyclicly  over  a  band  of  significant  width  as  determined  by  the  daily 
load  variations.  Fig.  1  below  is  a  representative  histogram  from  that 
study  of  all  software  failures  plotted  by  the  hour  of  day*  averaged  over 
1978. 

A  more  detailed  and  accurate  analysis  on  a  different  system  was  con¬ 
sidered  necessary  before  such  results  could  be  considered  representa¬ 
tive.  The  VH/370  system  on  the  IBM  3081  at  SLAC  (in  service  since  Feb¬ 
ruary  1981)  provided  an  ideal  opportunity  in  this  respect.  We  commence 


2  At  the  time  of  the  previous  study*  the  SLAC  system  consisted  of  two 
IBM  370/ 168s  and  one  IBM  360/91  configured  in  a  triplex  mode. 
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SLAC  Component  Failure  Profiles 


Figure  1:  Software  failures  by  hour  of  day  (SLAC  Triplex). 


by  describing  our  measurements  of  failures  and  system  activity. 

3.  SOURCES  OF  DATA 

As  explained  in  the  Introduction,  we  wished  to  study  the  occurrence  of 
failures  in  relation  to  the  system  activity  at  the  time  of  the  failure. 
To  begin  with,  we  restricted  our  analysis  to  all  abnormal  terminations 
of  the  system.  The  data  on  these  events  came  from  the  system- 1PL  (Ini¬ 
tial  Program  Load)  log.  automatically  recorded  by  the  operating  system. 
Oats  on  system  utilization  and  performance  came  from  the  vn  aooounting 
and  performance  system.  To  avoid  the  collection  of  misleading  startup 
data,  the  first  month  of  normal  operation  was  ignored. 
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3.1  FAILURE  MIA 

Failure  data  for  thla  atudy  orlglnatas  In  an  automatical ly  eollootad  log 
of  all  XPl's  of  tho  system.  both  aohadulad  and  non-scheduled.  For  tha 
non-aehodulad  I  PL's  tha  problem  la  investigated  and  a  determination  of 
tho  eauaa  of  failure  la  aada.  This  may  involve  harduara  repair  or  tho 
atudy  of  a  system  dump.  Finally*  tha  manager  of  systems  or  head  system 
programmer  enters  tho  cause  of  failure.  On  tho  basis  of  their  determi¬ 
nation  tho  failures  uore  tagged  using  tho  fol lowing  categories! 

1.  Hardware  (HU)  -  A  hardware  failure  oaused  tha  crash. 

2.  Software  (SU)  -  A  software  failure  oaused  the  crash.  If  both 

hardware  and  software  were  involved,  then  the  HS  category  is  also 
indicated. 

3.  Operator  induced  (OPR)  -  Any  human  error. 

4.  Unknown  (UNK)  -  Nothing  could  be  blamed. 

5.  Repeat  (RPT)  -  A  reoccurrence  of  a  previous  failure. 

Uhen  software  is  involved,  the  following  additional  categories  are 
defined: 

1.  CTL  -  A  control  or  logio  problem. 

2.  ERH  -  An  error  handling  problem. 

3.  HSE  -  A  hardware  error-handling  problem  due  to  lack  of  robustness 
in  the  software. 

4.  TIM  -  A  timing  related  problem. 

A  sample  of  the  online  failure  data  base  created  on  this  basis  is  shown 
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b)  TTIflC  -  Total  prooossor  tiae  (VTIHE  plus  overhead,  fraction  of 
tuo  prooosaors). 

2.  Intoraetiva  Execution 

a)  PA6EIN  -  Total  nuabar  of  page  roads  (par  aaeond). 

b)  PABEOUT  -  Total  nuabar  of  page  uritss  (par  aaeond). 

e)  SIO  (Start  I/O)  -  Total  nuabar  of  non-spool ad  input/output 
operations  (par  aaeond). 

3.  Others 

a)  OVERHEAD  -  Daaands  pi  seed  on  the  operating  systaa  by  user  jobs 
(TTIHE  -  VTIHE). 

b)  PRINT- -  Total  nuabar  of  virtual  lines  printed  (par  saoond). 
o)  PUNCH  -  Total  nuabar  of  virtual  oards  punched  (par  second), 
d)  READER  -  Total  nuabar  of  virtual  cards  read  (par  saoond). 

3.3  tlATCHINS  SOFTWARE  FAILURES  AND  SYSTEM  ACTIVITY 

In  order  to  analyze  the  level  and  nature  of  systaa  utilization  in  an 
accurate  and  efficient  Manner,  a  unifora  data  base  containing  the  values 
of  all  the  workload  variables  prior  to  the  oeourrenoe  of  a  failure  was 
ereatsd. 

The  first  step  was  to  oreate  S-ainute  time  averages  for  all  workload 
paraaeters  for  the  entire  period  of  our  study  (Nareh  1981  through  April 
1982).  A  saaple  of  this  data  for  PABEIN  and  TTII1E  appears  in  Fig.  3. 
For  aatehing  purposes,  the  workload  in  a  specified  Interval  prior  to  the 
failure  was  ooabined  with  the  failure  point.  This  is  illustrated  in 


9  The  3081  is  a  "dyadic"  or  dual  prooessor.  Full  utilisation  is  defined 
to  be  2.0. 
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Fig.  4.  Aft# r  sons  experimentation  the  average  load  In  ■  ono  hour 

Interval  prior  to  the  failure  uaa  found  to  be  the  neat  suitable. 


PAGEIN/ sec  TTIME  (frac) 


Una  ol -Day  Time  of  Day 

Figure  3*  A  one  day  sample*  PA8EJN  and  TT1HE 


Figure  4:  notching  .failures  and  uorkload 
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Th#  creation  of  these  data  baaaa  required  etapln  processing  In  order 
to  minimize  the  loss  of  Inforaaftlon  that  Invariably  accompanies  ouch 
procedures.  Tht  aoftuart  aystaa  davoloptd  for  thla  purpoaa  la  dlaeuaood 
In  [Roosattl  81].  Tho  syotoa  la  highly  Intoraetlvo  and  alloaa  offlelant 
handling  of  largo  aaounto  of  data  of  varying  foraata  and  complexities. 

4.  SOFTWARE  FAILURE  CHARACTERISTICS 
Ha  ooaatnoa  our  analysis  by  tabulating  seat  example  failure  statistics. 
As  a  first  stop  ua  eoaparad  our  results  with  those  obtained  for  failures 
at  SLAC  on  the  Triplex  configuration  [Butner  80].  This  eoaparlson  Is 
shoun  In  Table  1.  It  Is  clear  that  In  toras  of  HTBF*  tho  new  systoa  Is 
at  least  tuiee  as  reliable  as  the  Triplex  oonf Iguratlon. 


TABLE  1 

Mean  Tlae  Between  Failure  Comparison 

SLAC  Triplex  and  SLAC  3081  (In  hours) 

Failure 

Triplex 

Early  3081 

Late  3081 

Type 

(1978) 

(Mar81-Jun  81) 

(Sep81-Apr82) 

All 

23.19 

40.41 

69.22 

Software 

33.10 

68.29 

110.19 

Hardware 

90.28 

90.79 

183.60 
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Table  2  provides  sort  detailed  tlae-batween-f allure  (TBF)  and  tlae- 
to-repalr  CTTR)  atatlatlea  far  the  3881  systea.  The  celuans  correspond 
to  seen,  standard  deviation,  alniaua.  and  aaxlaua-  values  for  eaeh  Meas¬ 
ure.  The  results  are  also  broken  deun  by  the  aajor  failure  categories 
defined  In  Seetlon  3.1.  and  are  Identified  by  rou.  He  also  divided  the 
tlae  period  In  our  study  Into  tue  parts »  an  Initial  period  free  March 
1981  through  August  1981*  and  a  sere  resent  period  frea  Septeaber  1981 
to  April  1982.  There  are  two  reasons  for  this:  first,  ue  expected  a 
loner  HTBF  In  the  early  part  of  systea  life  than  In  the  later;  sec¬ 
ondly.  the  systea  load  began  to  stabilize  and  reach  peak  values  such 
aere  often  during  the  second  part  of  the  study.  In  Interpreting  Table 
2.  note  that  the  sub-categories  are  not  autually  exclusive.  A  software 
failure  (SU)  could  also  ba  flagged  with  ERH  if  error  handling  was  judged 
to  be  part  of  the  problea. 


TABLE  2 

Failure  and  repair  statistics  (SLAC  3081) 
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la  tha  TIT  column*.  it  is  interesting  te  im  the  draaatie  tapreveaoat 
ia  rtl lability  batwaan  tha  aarly  and  lata  data,  far  example*  bardaara 
reliability  aora  than  daub lad  and  eeftuare  rat  lability  improved  by 
alaaat  SO  percent.  From  tha  Tit  oaluana  aa  aan  aaa  that*  as  expected, 
harduara  failuras  causa  longer  down  tlaas  (av|.  1  haur)  than  aaftaara 
falluraa  (avg.  0.42  hours).  Tha  table  alas  sheas  that  tha  leanest  dean 
period  aas  19.25  hours*  and  that  it  aaa  due  te  a  hardaare  preblea. 

Table  3  balou  gives  all  tha  uni pus  failure  patterns*  aith  their  fre¬ 
quency  of  aosurrtnoa.  Me  notice  that  failuras  share  only  hardaare  aas 
Involved  aoeount  for  just  20X  of  all  aystaa  reloads*  ahila  over  SOS  of 
all  eases  relate  to  a  softuaro  p rob lea.  That  SOX  includes  situations 
uhero  both  hardware  and  softuaro  aero  involved  USX  of  the  total).  In 
aost  of  the  harduare/softuare  eases  (approxiaatoly  12X  of  all  failuras) 
a  coaaon  scenario  was  that  a  hardaare  failure  Bade  tha  systaa  go  Into  a 
region  of  the  software  which  aas  not  auffleiently  robust  to  handle  the 
problea  (a  "harduare/softuare  arrar  handling  preblea").  Of  the  remain¬ 
ing  software  failures*  control  probloas  (3SX  af  all  failuras)  and  error 
hendling  probloas  dealaate  (27X  of  all  failuras).  Synchronization  and 
tlalng  window  probloas  aoeount  for  10X.  No  also  notion  a  rather  large 
share  of  repeats  -  approxiaatoly  15X. 
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4.1  STATISTICAL  TESTS 

In  eomran  tilth  other  analyses  of  this  type,  our  first  tost  mss  to  Inves¬ 
tigate  the  distribution  of  the  tine  between  f si lures.  A  Kolnogorov- 

Salmov  test  conducted  on  the  tine  between  failure  distribution  aeeepted 
at  any  level  of  significance  a  Ueibull  distribution  with  the  following 
density  funetlon  and  paraieetersi 

f(t)  >il  t»-»  asp  C-a  t»] 

where*  a  •  8.092  (oharaoterlstle  life  of  40.0  hours) 

B  *  0.047 


Sine*  •  <  1.0,  this  is  otesrly  s  distribution  with  s  deeressing  fsilure 
rsts  in  tins.  Ths  •spiriesl  and  Usibull  eumulativo  distribution  func- 
tions  sro  plotted  in  Fig.  5.  This  also  conforms  uith  tho  plot  of  tho 
■onthly  avorago  failure  rates  in  Fig.  6. 


Figure  5:  Theoretical  and  anpiriea!  Moibull  distributions  (edf) 
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Figure  6:  Average  software  failure  rate  (by  Month) 


5.  SOFTWARE  FAILURES  -  DEPENDENCE  ON  SYSTEM  ACTIVITY 
In  this  section  ue  attempt  to  relate  the  oeeurrenoe  of  software  failures 
to  the  type  of  system  utilization  as  measured  by  the  various  workload 
variables.  It  Is  envisioned  that  such  experiments  would  provide  insight 
into  the  most  probable  "cause  and  effect"  relationships  for  software 
failures. 


It  is  to  be  expected  that  most  workload  measures  will  be  eyelle  on  a 
daily  basis.  Accordingly,  it  was  instructive  to  examine  the  mean  soft¬ 
ware  failure  rate  behavior  over  the  same  period  [Beaudry  7S].  This  pro- 
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net  only  •  quick  visualization  el  significant  failure  trends  but 
uss  else  useful  in  developing  subsequent  statistieel  experlnents. 

figure  7  shows  s  histogran  of  failures  by  hour  ef  day  fer  all  soft¬ 
ware  related  failures.  Nate  the  sharp  rise  in  failure  rate  during  early 
warning  hears,  this  appears  te  follow  rather  elesaly  the  typieal  inter¬ 
active  lead  at  SLAC  and  also  compares  favorably  with  a  similar  plot  of 
oaf t ware  failures  for  the  SLAC  Triplex  oonf iguration  (Fig.  1). 


j 


i  jK'iB' 


5.  i  flisiamtmms  sl  emims  am  system  activity 

As  explained  in  Section  3.1,  our  dots  provided  us  uith  s  set  of  system 
workload  measures.  In  particular,  measures  such  as  the  paging  (PAGEIN 
and  PA6E0UT )  and  input/output  (SIO)  rates  provide  a  measure  of  the  sys¬ 
tem  interactive  load,  while  measures  such  as  TTIME  and  VTIME  provide  a 
general  view  of  the  CPU  usage.  The  variable  "OVERHEAD”,  derived  from 
the  difference  between  TTIME  and  VTIME,  is  a  direct  measure  of  the 
demands  being  placed  on  the  operating  system  by  users'  programs  (actu¬ 
ally  virtual  machines). 

Reoall  that  the  data  base  developed  contains  not  only  the  values  for 
the  specified  workload  variables  to  a  five  minute  resolution  but  also 
the  values  of  the  same  variablea  matched  with  failure  times.  From  this 
data  three  types  of  distributions  were  generated.  The  first,  Jt(x>,  is 
simply  the  distribution  of  the  workload  in  question. 

Jt(x)  =  Pr  {Workload  Measure  =  x} 

The  second  is  the  joint  distribution  of  the  failure  and  the  workload 

measure : 

ftx)  =  Pr  {Failure  Occurs  and  Workload  =  x}  . 

This  is  easily  obtained  from  the  failure  matched  data  base,  the  genera¬ 
tion  of  which  was  described  in  Section  3.1. 

In  f(x)  both  the  failures  and  the  workload  measures  are  represented 
as  they  occur  in  the  system.  Clearly  the  more  favored  values  for  a 
given  workload  will  contribute  more  to  this  distribution  than  the  less 


favored 
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ones.*  Using  ths  yell  known  notion  of  conditions)  probability*  ms 
dsfins: 

f(x) 

g(x)  =  Pr  {Failure  Occurs  I  Workload  =  x}  =  -  . 

Jt(x) 

g(x)  can  now  bo  thought  of  as  ths  probability  of  a  software  failure  at  a 
given  value  for  workload  when  all  values  are  equally  represented.  Fig¬ 
ure  #  shows  the  plots  for  Jt(x)>  f(x)>  and  g(x)  for  three  selected  work¬ 
load  variables:  PAGEOUT,  SIO,  and  TTIME;  All  software  failures  are  con¬ 
sidered;  see  the  appendix  for  other  variables  and  subclasses. 

As  a  general  observation  we  note  that,  where  the  difference  between 
Jl(x)  and  f(x)  is  considerable,  we  might  expect  to  see  a  workload  depen¬ 
dency  in  the  failures.  If  JUx>  and  f(x)  are  similar,  the  relationship 
is  probably  not  significant.  A  g(x)  distribution  sharply  weighted  in 
favor  of  higher  workload  values  will  clearly  generate  a  higher  risk  of 
failure  as  the  load  increases. 

It  would  appear  from  the  g(x)  plots  for  PAGEOUT  and  SIO  that  higher 
values  of  these  measures  (>  10  for  PAGEOUT)  contribute  more  signifi¬ 
cantly  to  software  related  failures  than  the  lower  values.  Examining 
the  plots  for  TTIME  we  note  that,  as  measured  by  CPU  utilization,  the 
system  was  heavily  loaded  (close  to  2.0)  most  of  the  time.  The  Jt(x)  and 
g(x)  plots  for  TTIME  show  considerable  similarity.  It  would  therefore 
appear  from  this  cursory  analysis  that  failures  are  not  induced  by 
higher  execution  rates,  as  measured  by  CPU  usage. 

*  A  rather  commonplace  analogy  to  illustrate  this  is  that  automobiles 
travelling  at  150  mph  have  a  higher  probability  of  an  accident  than 
those  travelling  at  55  mph.  There  are  however  far  fewer  accidents  at 
150  mph.  To  obtain  an  accurate  representation  of  the  risks  involved 
in  travelling  we  must  divide  the  number  of  accidents  at  high  speeds  by 
the  number  of  autos  travelling  at  that  speed. 
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Figur#  8:  Frtqutncy  distributions:  A(x)»  fix),  and  g(x) 
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In  order  to  quantify  this  effect.  In  particular  to  determine  exaotly 
the  risk  or  "hazard"  associated  with  higher  workload  values,  ue  eaployed 
uhat  ue  refer  to  as  a  "load  hazard"  model,  the  development  and  applica¬ 
tion  of  which  Is  discussed  In  the  next  section. 

5.2  &  SOFTWARE  LSM  HAZARD  MODEL 

The  object  of  our  analysis  uas  to  determine: 

1.  Does  a  higher  level  of  system  utilization  result  In  a  higher  risk 
of  failure  than  a  lower  level? 

2.  Is  the  relationship  linear  with  the  workload  variables,  or  is 
there  a  nonlinear  increasing  effect? 

In  practical  terms,  if  such  an  effect  exists,  we  expect  the  load  to  act 
as  a  stress  factor.  For  this  purpose  we  developed  and  validated  a 

load-hazard  model  which  formed  the  basis  for  our  tests.  A  detailed 
description  of  the  development  and  validation  of  this  model  Is  discussed 
in  Clyer  82].  Briefly,  an  inherent  load  hazard  z(x)  is  defined  as 

Pr  {Failure  in  load  interval  (x,  x+Ax)}  g(x) 

z(x)  =  -  =  -  (1) 

Pr  {No  failure  in  load  interval  (0,  x)}  1  -  6(x) 

where: 

g(x)  is  as  defined  in  section  5.1,  and 

G(x)  is  the  cumulative  distribution  function  of  g(x). 

In  close  analogy  with  with  the  classical  hazard  rate  in  reliability 
theory  [Shooman  68],  z(x)  measures  the  incremental  risk  Involved  in 
increasing  the  workload  from  x  to  x+Ax.  If  z(x)  Increases  with  x,  it 
should  be  clear  that  the  there  Is  an  increasing  risk  of  a  failure  as  the 
workload  variable  Increases.  If,  however,  ztx)  remains  constant  for 
increasing  x,  ue  may  surmise  that  no  increased  risk  is  involved. 
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Nets  that  In  our  definition  of  load  hazard  ua  hava  removed  tha  vari¬ 
ability  of  system  load  by  using  tha  conditional  probability  g(x).  This 
of  course  Is  not  true  In  practiea  sines  load  is  best  described  as  a  ran¬ 
dom  variable  with  a  probability  distribution!  it  is  siaply  tha  associ¬ 
ated  load  distribution,  Jt(x),  defined  above.  In  order  to  determine  the 
hazard  for  a  particular  load  pattern,  ua  aust  multiply  the  associated 
load  probability  by  the  hazard  calculated  in  (1).  Denoting  by  za(x) 
the  transforaed  hazard,  ua  hava 

za(x)  =  z(x)  X(x)  (2) 

He  refer  to  the  hazard  z(x),  as  defined  in  (1),  as  the  fundamental 
hazard.  This  is  because  it  can  be  thought  of  as  an  inherent  property  of 
a  particular  system  and  is  not  subject  to  varying  load  patterns.  Uhen  a 
varying  load  pattern  is  taken  into  account,  it  can  be  thought  of  as 
"picking  out"  aspects  of  tha  fundamental  hazard  function.  This  hazard 
za(x)  defined  in  (2)  uill  be  referred  to  as  the  apparent  hazard,  since 
it  is  closely  dependent  on  the  load  distribution. 

Tha  following  example  illustrates  hou  a  particular  workload  can  mod¬ 
ify  a  given  fundamental  load  hazard  z(x).  Figure  9(a)  shows  a  sample 
fundamental  hazard  z(x).  Note  that  z(x)  is  increasing  with  load.  Thus, 
if  all  load  values  are  equally  likely,  the  system  has  a  higher  risk  of 
failure  at  higher  load  values  than  at  lower  load  values.  Figure  9(b)  is 
a  hypothetical  load  distribution  where  the  load  variable  is  tha  frac¬ 
tional  CPU  utilization,  with  0  for  an  idle  CPU  and  1  for  a  fully  busy 
CPU.  Finally,  Fig.  9(c)  gives  the  apparent  hazard  due  to  the  effect  of 
the  load  distribution  In  (a).  The  apparent  hazard  is  now  decreasing 
simply  because  higher  load  values  are  lass  probable. 
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(a)  fundamental  Hazard  (b)  Loud  Distribution  (e)  Apparent  Hazard 


Figure  9t  Exaapl*  of  fundaaantal  and  apparent  hazards 

5.3  HAZARD  PLOTS 

The  generation  of  the  hazard  plots  and  associated  statistics  involved 
extensive  data  processing.  In  each  hazard  plot.  z(x)  or  za(x)  is  calcu¬ 
lated  and  plotted  as  a  function  of  a  ehosen  workload  variable,  x.  In 
developing  hazard  plots  for  the  load-failure  data,  those  factors  not 
related  to  load  are  expected  to  behave  as  noise  in  a  load-failure  analy¬ 
sis.  If  such  factors  are  predoainant.  we  oan  expect  to  find  no  discer- 
nable  pattern  in  our  hezard  plots,  i.e.  they  should  appear  as  uncorre¬ 
lated  clouds. 

An  easily  discernable  pattern,  on  the  other  hand,  would  indicate  that 
the  load-failure  dependency  doainates  others.  The  strength  of  such  a 
relationship  can  be  aeasured  through  regression.  Figs.  10.  11.  and  12 
depiet  the  hazerd  plots  for  three  seleeted  lead  paraaeters  (PA6EIN, 
SIO.  and  OVERHEAD).  These  plots  relate  to  all  software  failures; 
see  the  appendix  for  other  variables  and  subclasses.  The  regression 
eoeffioient  R*.  whioh  is  an  effective  aeasure  of  the  goodness  of  fit. 
is  provided  for  eaeh  plot.  Quite  siaply.  it  aeasures  the  eaount  of 
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variability  in  tha  data  that  oan  ba  aeeountad  for  by  tha  regression 
••dal.  ft*  valuas  of  greater  than  8.6  (corresponding  to  an  ft  >  8.75) 
•ra  ganarally  interpreted  as  strong  relationships  [Younger  79J.* 

It  Mould  appear  frea  our  data  that  many  of  the  uorkload  parameters 
•ra  acting  as  a  stress  factor.  i.e.»  that  there  is  an  increasing  risk  of 
failure  with  increasing  load.  In  tha  oasa  of  the  interaotive  uorkload 
■easures  OVERHEAD  and  SIO  there  is  no  doubt  that,  statiatieally.  there 
is  an  increased  risk  of  a  softuara  failure  as  tha  load  inoroaaaa.  The 
oorrelation  coefficients  of  8.95  and  0.91  shou  that  a  vary  olose  fit  uas 
obtained  and  that  the  failures  eloaely  fit  an  increasing  load-hazard 
aodel.  The  risk  of  a  failure  also  appears  to  inereaso  uith  inoreasod 
PA6E1H.  although  at  a  somewhat  lower  oorrelation  (ft  *  8.82).  Impor- 
tantly.  we  note  thati 

1.  Ue  are  not  seeing  a  statistioal ly  higher  failure  rate  simply  due 
to  greater  execution.  With  CPU  usage  (TTIftE)  as  a  Measure,  one 
finds  that  the  oorrelation  is  unacceptable,  i.e.*  that  no  rela¬ 
tionship  exists.  This  would  appear  further  proof  that  simply  a 
greater  execution  rate  (as  aeasured  by  CPU  utilization)  is  not  a 
najor  cause  of  the  observed  failures. 

2.  The  relationship  is  highly  non-linear,  i.e.  the  risk  of  a  failure 
markedly  Increases  as  workload  variables  reach  peak  values.  This 
tends  to  indicate  that  there  is  a  complex  set  of  interactions 
that  adversly  af foots  the  operating  system  as  end  points  are 
reached. 


*  The  range  ef  | ft |  from  0  to  1  Is  typioally  divided  as  fell  oust  (8. 
6.25)  moderately  weak;  (8.25.  8.5)  moderate;  (8.5.  8.75)  moderately 
strong;  (8.75.  1.6)  strong. 
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Figure  12 i  Hazard  plotat  OVERHEAD 

The  vertical  scale  la  logarlthale  In  these  plats.  Indicating  that  the 
hazard  la  rising  sharply  at  peak  leads. 
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In  the  next  see t Ion  we  provide  conjectures  on  the  possible  ceases  ef 


this  dependency  end  provide  further  interpretstlen  ef  ear  result*. 

6.  VltUS  ON  0MERVE8  RESULTS 

The  question  ef  fault  telerent  design  hes  been  studied  by  s  masher  ef 
authors  (see  for  txupli  (Neeht  SO*  SSbl  end  [Ysu  SO]).  In  order  to  be 
failure  tolerant*  the  software  oust  be  able  to  deel  ulth  adverse  offsets 
In  s  sell  defined  asnner.  The  Ideal  situation  Is  one  where  an  error  Is 
deteeted  st  the  earliest  tine*  thus  containing  the  lapset  of  the  error 
to  Its  nlnlnua. 

It  Is  elesr  f roe  our  analysis  that*  In  practice*  ue  are  far  free  this 
Ideal*  even  In  a  well  structured  systssi  such  as  vn/370.  What  ue  observe 
Is  that  Often  the  nest  severe  nalfunetlons  occur  when  the  workload 
beeones  sere  ooaplex.  Owe  to  th*  extensive  degree  of  Inter-user  and 
user-systea  dependency  under  these  elreunstsnees*  It  Is  usually  net  pos¬ 
sible  to  contain  tha  Inpeet  of  th*  error  end  a  eye tea  era sh  ensues. 

Each  of  th*  fol lowing  sections  discusses  a  particular  uey  that  systea 
failures  are  thought  to  relate  to  the  quality  and  quantity  of  workload. 
Exaaples  of  typical  SLAC  failures  are  given  In  each  ease.  Not*  that  a 
detailed  analysis  was  perferned  on  every  systea  crash  to  deteralne  th* 
exact  cause  of  failure.  This  was  dene  by  careful  tracking  and  record 


keeping  by  the  SLAC  systea  support  staff. 
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ft.  1  STITT H  >HIH  axammTiawa 

At  MMrwis  Mints  In  tbs  design  and  ceding  lyttw  aaaponaats 
InnHolt  and  explicit  asauaptiens  nrs  node  absut  ths  onvirenaont  the 
aoaponent  nil!  be  subjected  ts.  Ths  three  onsss  below  characterise  ths 
nost  popular  typos  of  assuaptien-related  failures. 

Queue.  Buffer,  end  Table  L Ini  tat  Usually  those  are  found  only  during 
extrene  situations*  where  load  Is  abnomally  high  or  unusual. 

Example*  Recently  the  nuaber  ov  users  legged  onto  the  SLAC  VH/37B 
systea  uent  above  250.  A  systeo  eonponent  failed  when  Its  table  eould 
hold  only  250  entries.  The  result  uas  not  catastrophic*  but  it  did 
affeot  systea  aoni taring. 

Synchronisation  a«oimtion«i  These  assuaptions  are  not  usually 
explicit.  They  are  aost  often  due  to  the  prograaaer  or  designer  not 
being  able  to  oonsider  that  a  certain  sequence  of  events  eould  occur  as 
sleuly  or  qulekly  as  night  happen  under  extrene  conditions. 

Exanolot  in  a  recent  oase  a  user  waited  a  long  tine  between  typing 
his  user  id  and  his  password  while  legging  on.  During  that  period  the 
user  directory  (containing  password  inforaation)  was  updated  by  other 
users  (changing  passwords*  etc.).  When  the  password  was  finally 
entered,  the  systea  crashed  because  the  logon  prooess  was  using  outdated 
pointers  into  the  directory  data  structure. 

Unanticipated  State  Changes »  Many  of  the  bugs  discovered  relate  to 
an  operation  that  is  soaehou  precepted  by  an  external  event.  Typically 
a  critical  section  was  not  adequately  protected  froa  the  event  and  a 
data  structure  or  prograa  was  foroed  into  an  inconsistent  state. 
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IUMIa*  A  MMktr  if  these  bugs  have  involved  a  sudden  leg  af ♦  dur¬ 
ing  an  operation  laini  Mrfaraad  an  a  user's  virtual  naahlna.  The  oper¬ 
ation  Is  only  partly  conpleted  and  a  ay a tan  data  struatura  la  nada  unu¬ 
sable. 


ft. 2  him  mwat-im  raimags 

Tills  fa  an  Inpertant  arrar  category,  eonprislng  roughly  27X  af  all  ays* 
tan  falluras.  A  slain  nada  far  tha  VH/370  typa  of  operating  systan 
struatura  Is  that.  baeausa  af  tha  Isolation  af  usars  and  systan  func¬ 
tions,  ral lability  can  ba  nuoh  battar  [Donovan  7ft].  Ona  ratlenala  far  a 
hiararehioal  systan  Is  that  tha  "vertical"  sayrayation  of  systan  func- 
tlons  Into  a  hlorarehy  and  tha  "horizontal"  isolation  of  usars  fron  aaeh 
othar  affords  aaslar  fault  Isolation  and  roeovary.  Conoiavably.  offend- 
Iny  usars  and  eoaponants  ean  ba  ranovad  in  nany  easas  uithout  loss  of 
tha  systan.  Ua  agree,  and  foal  that  tha  ralatlvaly  high  rallablllty  of 
tho  SLAC  systan  Is  dua  In  part  to  tha  VH/370  dasiyn.  However,  sona  of 
tha  rasilianey  axpaotad  is  not  Inplenented,  aausing  orror  handling  to  bo 
Involvad  in  a  largo  fraction  of  systan  erashas.  He  divide  this  category 
as  foil oust 

ft. 2.1  Hardware  Induced  Errors 

About  22X  of  software  failures  (24  of  10ft)  Involvad  tha  failure  of  tha 
software  to  continue  after  a  non-oatastrophic  hardware  error.  Thaso 
only  Include  easas  where  It  was  deternlnod  that  tha  systan  should  have 
bean  dealgned  to  continue,  possibly  but  not  necessarily,  in  a  degraded 


node. 


2t 


Previous  itvriln  diM  that  lyitw  MtWityt  especially  I/O  activity, 
U  strongly  related  ta  yrtctiwr  fiilyri  rates  tlyar  92]  [Castilla  til. 
Slnaa  aaaratlng  syataas  ara  required  ta  raaet  ta  auch  falluras  It  aan  be 
expected  that  more  aoftaara  falluras  ulll  aeaur. 

inaut/Outnut  Error* »  it  Is  elsar  that  I/O  arrar a  ara  directly 
ralatad  ta  tha  I/O  rata  ar  tha  amount  af  data  being  tranafarrad.  Sine a 
a  nantrlvlal  fraction  af  alt  I/O  ean  critically  affaet  systea  operation, 
tha  exposure  ta  ayataa  failure  ean  be  axpaetad  to  Inereaae. 

iUBBiAi  Errors  uhlla  transferring  senary  pages  from  or  to  an  exter¬ 
nal  device  ean  be  catastrophic.  Unfortunately,  aany  aystaaa  da  not 
adopt  a  strategy  of  graeoful  degradation  In  such  an  area;  tha  next  tac¬ 
tion  addresses  this  question. 

Hlorocodo  Error* »  Most  nodarn  eoaputar  systoas  rely  heavily  on 
aloreoodo  In  various  systea  eoaponents.  In  tha  ion  2001  Its  use  Is  per¬ 
vasive  -  controlling  essentially  all  harduara  eoaponents  froa  tha  opera¬ 
tor's  console  to  perforalng  failure  diagnosis  [Rallty  02].  Microcode 
even  controls  and  monitors  tho  3001's  pouar  supplies  and  thermal  state 
In  real  tine.  Sueh  a  eoaplex  microcode  system  provides  a  variety  of 
rare  stataa  to  be  entered  during  Intricate  event  saquanoas. 

£mg]j|i  The  3002  processor  control  lor  for  tha  3001  has  been  respon¬ 
sible  for  a  number  of  falluras  due  to  both  aieroeoda  and  harduara  fail¬ 


ures 
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0.2. 2  HftB  Nifdmrt  Induced  Error* 

Th*  result*  given  earlier  in  this  paper  demonstrate  that  software  fail¬ 
ures  Mill  oeeur  sore  under  high  spates  stress.  It  therefor*  follows 
that  greater  robustness  is  needed  in  handling  error  situations.  About 
21X  of  software  failures  (23  of  108)  involved  the  detection  of  inconsis¬ 
tencies  in  system  data  structures  or  a  weak  response  to  the  failure  of  a 
particular  software  component.  In  almost  all  of  these  eases  it  seemed 
that  th*  control  program  should  have  been  able  to  sever  or  mend  the  ail¬ 
ing  component  and  either  recover  or  degrade  gracefully.  In  these  fail¬ 
ures  blame  could  not  be  placed  on  the  hardware. 

6.3  SOFTWARE  CONTROL  ERRORS 

This  category  corresponds  generally  to  th*  classical  meaning  of  a  bug. 
There  are  two  levels  of  behavior  in  this  category.  The  first  involves 
the  discovery  of  latent  errors;  th*  second  relates  to  th*  violation  of 
space  or  timing  constraints. 

6.3.1  Discovery  of  Latent  Errors 

A  process  Inherent  in  the  life  of  a  mature  production  system  is  th*  dis- 
eovery  of  latent  (or  dormant)  bugs  CMusa  801.  Th*  relation  of  this  type 
of  error  to  workload  Is  evident.  Hell  used  sections  of  code  tend  to  be 
more  reliable  simply  because  bugs  have  already  been  discovered  and 
removed.  Under  normal  loads  these  section*  tend  to  be  heavily  used  and 
th*  system  remains  reliable.  However,  during  periods  of  stress  or 
uncommon  workload  patterns,  rarely  used  code  can  be  executed,  leading  to 
the  discovery  of  errors. 
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fufUi  A  prise  example  of  this  phmtMntfl  Involved  a  soot  Ion  of 
systea  Initialization  eodo  to  handle  the  ease  of  finding  a  faulty  stor¬ 
age  page  frame.  Sineo,  in  the  period  of  over  a  year*  the  3081  had  not 
encountered  sueh  an  error,  the  eode  had  never  been  executed.  The  first 
tine  a  defective  fraae  uaa  found,  an  obvious  coding  error  uas  unoovored. 
In  this  ease  the  systea  eould  not  even  be  restarted  to  repair  the  error. 

0.3.2  Soaoo  nut  list  .YIpIaIImU 

In  a  typical  tiaosharing  environaent  the  variety  of  donands  aade  on  the 
systoa  ( complexity  of  the  load)  is  directly  related  to  the  number  of 
users  on  the  systea.  Although  it  is  not  necessarily  true  that  the  num¬ 
ber  of  program  states  Mill  increase  uith  load,  it  is  clear  that  the  num¬ 
ber  of  timing  and  data  structure  states  Hill.  He  observe  that  this 
increase  is  greater  than  linear  uith  the  workload  variables  presented  in 
this  paper.  In  fact,  this  mushrooming  of  states  may  explain  the  expo¬ 
nential  inorease  of  failure  hazard  uith  load. 

One  practical  uay  to  study  control  failures  is  to  classify  the  errors 
Into  violations  in  space  (e.g.»  overwriting  storage,  invalid  operations) 
and  violations  in  time  (e.g.»  simultaneous  update,  invalid  sequence  of 
operations,  insufficient  looking  of  or iti cal  data).  Experience  has 
shown  that  in  a  stable  systea  sueh  as  Yfi/370  the  space  violations  are 
culled  acre  quickly  than  timing  violations  beeause  theyi 

1.  are  easier  to  understand! 

2.  usually  manifest  their  effects  Immediately  and  disastrously! 

3.  tend  to  bo  unaf footed  by  the  dilation  and  contraction  of  time 
scales  eausod  by  load. 
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On  the  oth er  hand ,  timing  ralatad  problems  ean  linger  in  a  system  for 
years  and  ean  be  particularly  sensitive  to  load  variations.  These 
errors  are  more  difficult  to  diagnose  because  specific  load  patterns  may 
be  required  to  reproduce  the  problem  and  because  the  manifestation  of 
timing  bugs  is  usually  subtle  and  complex. 

Example:  At  times  a  failure  will  occur  that  is  a  combination  of  both 
a  time  and  a  space  violation.  Typically  a  complex  set  of  events  will 
lead  to  a  timing  error,  which  triggers  the  overwriting  of  an  area  of 
storage.  Such  bugs  are  extremely  difficult  to  diagnose. 

6.4  SRAflVAL  HECM  fl£  Ifli  S.YSIEI3 

A  new  class  of  non-catastrophic  errors  begins  to  surface  as  a  system 
becomes  more  reliable.  These  have  to  do  with  the  gradual  loss  of  system 
resources,  such  as  memory  frames  or  free  disk  blocks  due  to  rare  housek¬ 
eeping  errors.  Since,  typically,  these  resources  are  redefined  at  each 
system  reload,  in  a  relatively  unreliable  system  their  loss  may  never  be 
noticed  by  the  system  or  its  users.  If,  on  the  other  hand,  the  system 
runs  for  weeks  without  failure,  then  the  gradual  loss  can  become  notice¬ 
able. 

Example:  In  the  SLAC  system,  an  unknown  bug  had  existed  for  years 
that  allowed  temporary  disk  space  to  be  lost  in  small  increments  over  a 
period  of  time.  After  a  10  day  period  without  a  system  reload,  users 
began  to  complain  about  the  lack  of  scratch  disk  space.  Investigation 
showed  that  the  sum  of  allocated  and  free  space  did  not  sum  to  the 
mount  "known"  to  be  available,  and  the  error  was  corrected. 


7.  CONCLUSION 


It  has  baan  tha  purposa  of  this  papar  to  prasant  an  analysis  of  softuara 
ralatad  system  failures  on  tha  1811  3081  at  SLAC.  We  find  three  broad 
oatagorias  of  failures!  error  handling,  program  control  or  logic,  and 
hardware-related.  A  statistical  analysis  of  these  failure  modes  shous 
(not  unexpectedly)  a  decreasing  failure  rate  with  time.  This  is  espe¬ 
cially  true  in  the  early  part  of  the  study.  Not  withstanding  the 
decreasing  failure  rate  with  time,  we  find  that  the  occurrence  of  the 
failures  is  strongly  correlated  with  the  type  and  level  of  workload 
prior  to  the  occurrence  of  a  failure.  Tor  example,  it  is  shown  that  the 
risk  of  a  software  related  failure  increases  in  a  non-linear  fashion 
with  amount  of  interactive  processing,  as  measured  by  parameters  such  as 
the  paging  rate  and  the  amount  of  overhead.  The  overall  CPU  execution 
rate,  though  measured  to  be  close  to  100X  most  of  the  time,  is  not  found 
to  correlate  with  the  occurrence  of  failures.  He  propose  a  load-hazard 
model  to  statistically  measure  the  above  effects.  Finally  the  paper 
offers  conjectures  on  the  observed  phenomenon. 

As  with  any  statistical  analysis,  this  is  not  proof  in  itself.  How¬ 
ever,  the  increasing  body  of  evidence  accumulated  on  different  computers 
with  differing  load  and  failure  patterns  shous  that  workload  should  be 
considered  as  a  factor  in  reliability.  The  design  of  computer  systems 
will  be  greatly  aided  if  this  type  of  analysis  can  help  uncover  cause 
and  effect  relationships  in  software  failures. 
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APPENDIX 


Load,  Failure »  and  Hazard  Plots 
For  three  error  types * 

Control  (CTL) ,  Error  Handling  (ERH),  Software  (SW) 


The  top  half  of  each  page  contains  the  Conditional  Failure, 
Load,  and  Joint  Failure  distributions.  The  bottom  half 
shows  the  corresponding  Fundamental  and  Apparent  Hazard 
functions.  The  error  type  is  indicated  in  small  type  just 
below  each  plot;  they  are  arranged  in  groups  of  three  pages 
each. 
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